32 ◾ Bioinformatics
due to the gene expression and that overrepresentation can be of a biological importance
rather than a bias. The overrepresented sequences report is a table that shows the over-
represented sequences, counts, percentage, and possible source. To save memory, only the
first 200,000 reads are checked in the FASTQ file; therefore, the list is not exhaustive and
other overrepresented sequences may skip the check. For each overrepresented sequence,
the FastQC program will search on a database of known contaminants and report the best
match that is at least 20 bases in length and has no more than a single mismatch. A warn-
ing will be issued if a sequence is overrepresented more than 0.1% of the total and failure
will occur if the overrepresentation is more than 1% of the total. As shown in Figure 1.24,
five overrepresented sequences are found, three of which are contaminating adaptors and
two sequences have no hits. The count and percentage reflect the significance of each of
these overrepresented sequences. The count of the first sequence in the table represents
29.4% of the total count of the reads in the FASTQ file. It is clear that this sequence is origi-
nated from a primer contamination and it must be removed before analysis.
1.5.11 Adapter Content
The full-length adaptor primers may cause contaminating adaptor dimers of a significant
number of reads. The adaptor content graph shows the cumulative percentage count of
FIGURE 1.23 Sequence duplication levels (warning and failure).
FIGURE 1.24 Overrepresented sequences.